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Abstract 

We propose a new statistics for the detection of differentially expressed 
genes, when the genes are activated only in a subset of the samples. Statis- 
tics designed for this unconventional circumstance has proved to be valu- 
able for most cancer studies, where oncogenes are activated for a small 
number of disease samples. Previous efforts made in this direction include 



COPA ( [Tomlins and others{2(M)\ ) , OS ( [Tibshirani and Hastie(2006)| ) 
and ORT ( Wu(2007) ). We propose a new statistics called maximum or- 
dered subset t-statistics (MOST) which seems to be natural when the 
number of activated samples is unknown. We compare MOST to other 
statistics and find the proposed method often has more power then its 
competitors. Cancer; COPA; Differential gene expression; Microarray. 

1 Introduction 

The most popular method for differential gene expression detection in two- 
sample microarray studies is to compute the t-statistics. The differentially ex- 
pressed genes are those whose t-statistics exceed a certain threshold. Recently, 
due to the realization that in many cancer studies, many genes show increased 
expressions in disease samples, but only for a small number of those samples. 
The study of Tomlins and others{2005)\ shows that t-statistics has low power 



in this case, and they introduced the so-called "cancer outlier profile analysis" 
(COPA). Their study shows clearly that COPA can perform better than the 
traditional t-statistics for cancer microarray data sets. 

More recently, several progresses have been made in this direction with the 
aim to design better statistics to account for the heterogeneous activation pat- 
tern of the cancer genes. In [Tibshirani and Hastie(2006)| , the authors intro- 
duced a new statistics, which they called outlier sum. Later, |Wu(2007)| pro- 
posed outlier robust t-statistics (ORT) and showed it usually outperformed the 
previously proposed ones in both simulation study and application to real data 
set. 

In this paper, we propose another statistics for the detection of cancer dif- 
ferential gene expression which have similar power to ORT when the number of 
activated samples are very small, but perform betters when more samples are 
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differentially expressed. We call our new method the maximum ordered subset 
t-statistics (MOST). Through simulation studies we found the new statistics 
outperformed the previously proposed ones under some circumstances and never 
significantly worse in all situations. Thus we think it is a valuable addition to 
the dictionary of cancer outlier expression detection. 



2 Maximum ordered subset t-statistics (MOST) 

We consider the simple 2-class microarray data for detecting cancer genes. We 
assume there are n normal samples and m cancer samples. The gene expressions 
for normal samples are denoted by Xij for genes i = 1,2, . . . ,p and samples j = 
1,2, .. .n, while ijij denote the expressions for cancer samples with i = 1,2, . . . ,p 
and J = 1, 2, ... m. In this paper, we are only interested in one-sided test where 
the activated genes from cancer samples have a higher expression level. The 
extension to two-sided test is straightforward. 

The usual t-statistics (up to a multiplication factor independent of genes) 
for two-sample test of differences in means is defined for each gene i by 

T, = (1) 

S'i. 

where Xi — '^jXij/n is the average expression of gene i in normal samples, 
iji — Uij /m is the average expression of gene i in cancer samples, and Si is 
the usual pooled standard deviation estimate 



si 



The t-statistics is powerful when the alternative distribution is such that yij , j = 

1, 2, . . . , m all come from a distribution with a higher mean. Tomlins and others{200E)\ 



argues that for most cancer types, heterogeneous activation patterns make t- 
statistics inefficient for detecting those expression profiles. They defined the 
COPA statistics 

^ ^ gr({y»j}i<j<m) - medi 
* madi ' 

where qr(') is the rth percentile of the data, medi = median{{xij}i<j<n, {yij}i<j<m) 
is the median of the pooled samples for gene i, and madi — lA826xmedian{{xij — 
medi}i<j<n, {yij —'medi}i<j<m) is the median absolute deviation of the pooled 
samples. 

The choice of r in Q depends on the subjective judgement of the user. The 
use of medi and madi to replace the mean and the standard deviation in ([T]) 
is due to robustness considerations since it is already known that some of the 
genes are differentially expressed. 

In ([2]), only one value of {yij} is used in the computation. A more efficient 
strategy would be to use additional expression values. Let 

Oi = {yij ■■ yij > q75{{xij}i<j<n,{yij}i 

) + IQR{{x,j] l<3<n, {Z/y }l<J<m)} 

(3) 
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be the outhers from the cancer samples for gene i, where IQR{-) is the interquar- 
tile range of the data. The OS statistics from Tibshirani and Hastie(2006)| is 
then defined as 

OS, = ^^JtilHi — L . (4) 

madi 

More recently, [Wu(2007)] studied ORT statistics, which is similar to OS statis- 
tics. The important difi'erence that makes ORT superior is that outliers are 
defined relative to the normal sample instead of the pooled sample. So in their 
definition, 

Oi = {Vij ■■ Vij > q75{{Xij}l<j<n) + IQR{{Xij}l<j<n)} (5) 

By similar reasoning medi in OS is replaced by medix and madj by median{{xij — 
medix}i<j<n, {Uij — 'mediy}i<j<m), where medix and mediy are the medians of 
normal and cancer samples respectively. 

In both OS and ORT statistics, the outliers are defined somewhat arbitrarily 
with no convincing reasons. To address this question, we propose the following 
statistics that implicitly considers all possible values for outlier thresholds. 

Suppose for notational simplicity that {yij}i<j<m are ordered for each i: 

Vii>Vi2>---> Vim- 

If the number of samples where oncogenes are activated were known, we would 
naturally define the statistics as 

X]i<j<fe(2/y ~ ined 
median{{xij - medij;}i<j<n, {yij mediy}i 

<j<'m } 

When k is not known to us, one would be tempted to define 

Mi = max Mik- 

l<k<m 

But this does not quite work since obviously Mik for different values of k are 
not directly comparable under the null distribution that Xij,yij ^ iV(0, 1). For 
example, when m = 2, we have E[yii — medix] > while i/Ej^i liVij — 
medix)] = 0. This observation motivates us to normalize Mik such that each 
approximately has mean and variance 1. This can be achieved by defining 
Mfc = £'Ei<j<fc ^j] and al = l^ar(X]i<j<fe Zj) where zi > Z2 > • • • > is the 
order statistics of m samples generated from the standard normal distribution. 
Then we can define Mik as: 

ix j \ , 

V 1-4826 X median{{xij - medix}i<j<n, {Vij - mediy}i<j<ra) J 

so that Mik has mean and variance approximately equal to and 1 respectively. 
Finally we can define our new statistics (called MOST) as 

Mi = max M,k. (8) 

l<k<m 
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With MOST, we practically consider every possible threshold above which 
are taken to be outliers. In this formulation, the number of outliers is implicitly 
defined as 

arg max Mik- (9) 

l<k<m 



3 Simulation studies and application 

Some simulations are carried out to study MOST, and compare its performance 
to OS, ORT, COPA, and t-sta tistics. For COPA, we choos e to use the 90th 
percentile in its definition as in Tibshirani and Hastie(2006)| . We generate the 



expression data fi'om standard normal with n = m = 20. For various values 
fc, 1 < fc < TO, which is the number of differentially expressed cancer samples, 
a constant /i is added for differentially expressed genes. We simulated 1000 
differentially and non-differentially expressed genes, and calculated the ROC 
curves from them by choosing different thresholds for gene calls. 

Figure [Hand [2] plots the ROC curves for some combinations of k and /i. For 
= 2 and k small, all five statistics behave similarly with t-statistics performing 
the worst. As k increases, t becomes better and OS and COPA begin to lose 
power. For n — 1 and medium to large fc, the performance of MOST is only 
worse than t and better than other statistics. Smaller fc in this case basically 
leads to ROC curve that is close to a 45° line for all statistics since the signal 
H = 1 is too weak in this case, so we do not show these results. For /i = 4 
and small k, MOST is better than ORT, COPA and t, and in this situation 
only OS is competitive with MOST. Larger fc in this case will produce nearly 
perfect ROC curves for all statistics, and thus those results are also omitted. 
Besides ROC curves, we have also tried examining the possibility of using ([9]) 
for estimating the number of differentially expressed samples fc, but so far have 
been unable to get a reasonable estimate out of it. 

From the above simulations, we judge that our new estimate MOST is at 
least as good as other previously proposed statistics, sometimes much better. 
Thus it is a valuable tool for detecting activated genes in many situations. 

As an example of real data application, the data from [West grtrf others{2001)\ 
is publicly available from http://data.cgt.duke.edu/west.phpl The microarray 



used in the breast cancer study contains 7129 genes and 49 tumor samples, 
25 of which with no positive lymph nodes identified and the other 24 with 
positive nodes. Similar to |Wu(2007)] , we take the log transformation of the 
expressions after normalizing the data. We apply MOST to the data and com- 
pare it to the t-statistics by computing the FDR using the SAM approach 
( jTusher and others{200l)\ ) . Figure [3| plots the FDR versus the number of genes 
called significant. For this example, MOST seems to perform a little better than 
t-statistics, although the difference is too small to be of any significance. 
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Figure 1: ROC curves estimated based on simulation. The number of nor- 
mal/cancer sample is n = m = 20. Various combinations of /x and fc's are 
chosen. Other uninteresting results where all statistics have close to perfectly 
good or bad performances are excluded as explained in the main text. 
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Figure 2: More ROC curves. 
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Figure 3: FDR versus the number of genes called significant. 
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